Description

Background & Context

The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas

As a Data scientist at Thera bank, we need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards

We need to identify the best possible model that will give the required performance

Objective

Data Dictionary

Import the libraries

Loading Data

Data summary

Shape of the data

View top and bottom 5 records

Data Types

  • There are a total of 21 columns and 10,127 observations in the dataset
  • We can see that Education_Level and Marital_Status have less than 10,127 non-null values i.e. columns have missing values.

Duplicates

Missing data

  • No missing values

Let's check the number of unique values in each column

Unique values for Category columns

Unique values for Numerical Columns

  • Age has only 45 unique values i.e. most of the customers are of similar age

Numerical column statistics

  • Mean value for the Customer Age column is approx 46 and the median is also 46. This shows that majority of the customers are under 46 years of age.
  • Dependent Count column has mean and median of ~2
  • Months on Book column has mean and median of 36 months. Minimum value is 13 months, showing that the dataset captures data for customers with the bank at least 1 whole years
  • Total Relationship Count has mean and median of ~4
  • Credit Limit has a wide range of 1.4K to 34.5K, the median being 4.5K, way less than the mean 8.6K
  • Total Transaction Count has mean of ~65 and median of 67

Categorical column statistics

  • The target variable Attrition Flag has Existing to Attrited ratio of 83.9 : 16.1. There is imbalance in the dataset
  • ~93% customers are having Blue Card
  • Income Category has a value abc for 10% records, which we'll change to Unknown

Pre-EDA data processing

Dropping Id column

Treating missing values in Education Level and Marital Status

Note:   The missing value treatment should be done after splitting the data into Train, Validation and Test sets. However, in this case, the treatment is generic, since we are filling in the data with Unknown. Thus, the treatment can be done on the overall dataset. Similar strategy is applicable for treating the Income Category column value abc

Treating Income Category = abc

Checking operation outcome

All the null data values have been treated along with the incorrect/junk data in Income Category column

Data type conversions

Converting the data type of the category variables from object/float to category

Standardizing column names

Removing the spaces from column names, and standardizing the column names to lower case

We'll move on to data analysis now.

Exploratory Data Analysis

Univariate Analysis

Numerical Feature Summary

The first step of univariate analysis is to check the distribution/spread of the data. This is done using primarily histograms and box plots. Additionally we'll plot each numerical feature on violin plot and cumulative density distribution plot. For these 4 kind of plots, we are building below summary() function to plot each of the numerical attributes. Also, we'll display feature-wise 5 point summary.

The data is normally distributed, with only 2 outliers on the right side (higher end)

Dependent Count is mostly 2 or 3

  • Most customers are on the books for 3 years
  • There are outliers on both lower and higher end

Most of the customers have 4 or more relations with the bank

  • There are lower and higher end outliers for Months inactive in last 12 months
  • Lower end outliers are not concerning since 0 value means the customer is always active. The customers who are inactive for 5 or more months are to be concerned about.
  • Again lower and higher end outliers are noticed.
  • Here less number of contacts between the bank and the customer should be interesting to be checked

There are higher end outliers in Credit Limit. This might be because the customers are high end.

The customers with credit limit more than 23K have ~87% people earning $60K or more, and 90% have Blue or Silver card

Total revolving balance of 0 would mean the customer never uses the credit card

  • Average Open to Buy has lots of higher end outliers, which means there are customers who uses only very small amount of their credit limit
  • Data is right skewed

Outliers are on both higher and lower end

Total Transaction Amount has lots of higher end outliers

Outliers are on both higher and lower end

Average utilization is right skewed

Percentage on bar chart for Categorical Features

For the categorical variables, it is best to analyze them at percentage of total on bar charts Below function takes a category column as input and plots bar chart with percentages on top of each bar

  • High Imbalance in data since the existing vs. attrited customers ratio is 84:16
  • Data is almost equally distributed between Males and Females
  • 31% customers are Graduate
  • ~85% customers are either Single or Married, where 46.7% of the customers are Married
  • 35% customers earn less than $40k and 36% earns $60k or more
  • ~93% customers have Blue card

Bi-variate Analysis

Goal of Bi-variate analysis is to find inter-dependencies between features.

Target vs. All numerical columns

With outliers

Without outliers

Attrited customers have  

  • Lower total transaction amount
  • Lower total transaction count
  • Lower utilization ratio
  • Lower transaction count change Q4 to Q1
  • Higher number of times contacted with or by the bank

Target vs. All Categorical Columns

  • Attrition does not seem to be related with Gender
  • Attrition does not seem to be related with Education
  • Attrition does not seem to be related with Marital Status
  • Attrition does not seem to be related with Income Category
  • Platinum card holder are appearing to be having attrition tendency, however, since there are only 20 data points for platinum card holders, this observation would be biased

Multi-variate Plots

Pairplot of all available numeric columns, hued by Personal Loan

  • There are clusters formed with respect to attrition for the variables total revolving amount, total amount change Q4 to Q1, total transaction amount, total transaction count, total transaction count change Q4 to Q1
  • There are strong correlation between a few columns as well, which we'll check in below correlation heatmap.

Heatmap to understand correlations between independent and dependent variables

  • Credit Limit and Average Open to Buy have 100% collinearity
  • Months on book and Customer Age have quite strong correlation
  • Average Utilization Ration and Total Revolving Balance are also a bit correlated it appears
  • Attrition Flag does not have highly strong correlation with any of the numeric variables
  • Customer Churn appears to be uncorrelated with Customer Age, Dependent Count, Months on Book, Open to Buy, Credit Limit, we'll remove these from dataset

Data Preprocessing

Pre-processing steps:

  1. Data Split into Dependent and Target sets
  2. Data Split to Train, Test and Validation sets
  3. Standardize feature names
  4. Drop unnecessary columns (Client Number, Customer Age, Dependent Count, Months on Book, Open to Buy, Credit Limit)
  5. Missing Value/Incorrect Value treatment
  6. Encoding
  7. Scaling/Outlier treatment

Building data transformer functions and classes

Firstly we'll work on building models individually after data pre-processing, and later we'll build an ML pipeline to run end to end process of pre-processing and model building. We are creating a data copy for the first part.

Creating data copy

Defining the static variables

Data Type Conversions

Here we are converting Object data type to Category

Dependent and independent variables

Splitting the dataset into dependent and independent variable sets

Split data in Train, Validation and Test sets

Checking the ratio of labels in the target column for each of the data segments

Data processing

Data pre-processing is one of the the most important parts of the job before starting to train the model with the dataset. We need to impute missing values, fix any illogical data value in columns, convert category columns to numeric (either ordinal, or binary using one-hot encoding), scale the data to deal with the distribution skewness and outliers, before feeding the data to a model.  

We are using the pre-available transformation classes and the custom classes that we created to first fit the training data and then transform the train, validation and test dataset. This is the standard logical practice to keep the influence of test and validation data in the train dataset to prevent/avoid data leakage while training or validating the model.

We are now all set to build, train and validate the model

Model Building Considerations

Model evaluation criterion:

Model can make wrong predictions as:

  1. Predicting a customer will attrite and the customer does not attrite - Loss of resources
  2. Predicting a customer will not attrite and the customer attrites - Loss of opportunity for churning the customer

Which case is more important?

How to reduce this loss i.e need to reduce False Negatives?

Let's start by building different models using KFold and cross_val_score and tune the best model using RandomizedSearchCV

Model Evaluation Functions - Scoring & Confusion Matrix

We are creating a few functions to score the models, show the confusion matrix

Function to Get Scores

Function to Draw Confusion Matrix

Function to Add Scores to Scoring Lists

Building Models

We are building 8 models here, Logistic Regression, Bagging, Random Forest, Gradient Boosting, Ada Boosting, Extreme Gradient Boosting, Decision Tree, and Light Gradient Boosting.

Build and Train Models

We are building below 7 models:
 

  1. Bagging
  2. Random Forest Classification
  3. Gradient Boosting Machine
  4. Adaptive Boosting
  5. eXtreme Gradient Boosting
  6. Decision Tree Classification (Classification and Regression Trees - CART)
  7. Light Gradient Boosting Machine

 

Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.
 

Since it is based on decision tree algorithms, it splits the tree leaf wise with the best fit whereas other boosting algorithms split the tree depth wise or level wise rather than leaf-wise. So when growing on the same leaf in Light GBM, the leaf-wise algorithm can reduce more loss than the level-wise algorithm and hence results in much better accuracy which can rarely be achieved by any of the existing boosting algorithms. Before is a diagrammatic representation by the makers of the Light GBM to explain the difference clearly.

Source: towards data science

Comparing Models

  • The best model with respect to cross validation score and test recall is Light GBM
  • The next best models are XGBoost, GBM and AdaBoost respectively

Plotting the cross-validation result comparison

We are plotting the cross validation results for the 7 models in a Box plot, to check which models are potentially good.

It appears Light GBM, XGBoost, GBM are the models with good potential. Ada Boost also looks good with the higher end outlier performance score

Oversampling train data using SMOTE

Our dataset has a huge imbalance in target variable labels. To deal with such datasets, we have a few tricks up our sleeves, which we call Imbalanced Classification.

Imbalanced classification involves developing predictive models on classification datasets that have a severe class imbalance.

The challenge of working with imbalanced datasets is that most machine learning techniques will ignore, and in turn have poor performance on the minority class, although typically it is performance on the minority class that is most important, which is the case in our study here.

One approach to addressing imbalanced datasets is to oversample the minority class. The simplest approach involves duplicating examples in the minority class, although these examples don’t add any new information to the model. Instead, new examples can be synthesized from the existing examples. This is a type of data augmentation for the minority class and is referred to as the Synthetic Minority Oversampling Technique, or SMOTE for short.

Build Models with Oversampled Data

Build and Train Models

We are building and training the same 7 models as before. We are however going to use the over-sampled training data for training the models.

Comparing Models

  • The best 4 models with respect to validation recall and cross validation score, are as follows:
  1. Light GBM trained with over/up-sampled data
  2. GBM trained with over/up-sampled data
  3. AdaBoost trained with over/up-sampled data
  4. XGBoost trained with over/up-sampled data

Undersampling train data using Random Under Sampler

Undersampling is another way of dealing with imbalance in the dataset.

Random undersampling involves randomly selecting examples from the majority class and deleting them from the training dataset until a balanced dataset is created.

Build Models with Undersampled Data

Build and Train Models

We are again building the same 7 models as before and training with the undersampled dataset, and use the validation dataset to score the models.

Comparing Models

  • The 4 best models are:
  1. XGBoost trained with undersampled data
  2. AdaBoost trained with undersampled data
  3. Light GBM trained with undersampled data
  4. GBM trained with undersampled data  

We will now try to tune these 4 models using Random Search CV

Choice of models for tuning

  1. XGBoost with down-sampling has the best validation recall of 96.3%, along-with 95% cross validation score on train, and 0.99 AUC, which means is it has high possibility of performing very well in unseen dataset. There is a bit of over-fitting, which I expect to resolve by tuning.  

  2. AdaBoost is generalizing the model very well, it is neither over-fitting, nor has any bias, AUC is 0.985 and cross validation score on train is 93%, recall on validation set is same as XGBoost (96.3%). I expect to improve the model (~94% on validation set) via tuning.  

  3. Light GBM works really well in all aspects, but there is slight over-fitting problem, which I expect to resolve by tuning. Accuracy on validation is 94%, with cross validation score on train 95%, recall on validation ~96%, AUC is 0.99. This looks like a very promising model.  

  4. GBM is not overfitting, and neither it is suffering from bias or variance. Recall on validation is ~96%, accuracy on validation ~94%, AUC is ~0.99, cross validation score on train is ~95%. This would be my top choice because none of the training scores are 100%, meaning it is not trying to explain every single aspect of training data by overfitting it.

Model Tuning using RandomizedSearchCV

Typically a hyperparameter has a known effect on a model in the general sense, but it is not clear how to best set a hyperparameter for a given dataset. Further, many machine learning models have a range of hyperparameters and they may interact in nonlinear ways.

As such, it is often required to search for a set of hyperparameters that result in the best performance of a model on a dataset. This is called hyperparameter optimization, hyperparameter tuning, or hyperparameter search.

An optimization procedure involves defining a search space. This can be thought of geometrically as an n-dimensional volume, where each hyperparameter represents a different dimension and the scale of the dimension are the values that the hyperparameter may take on, such as real-valued, integer-valued, or categorical.

Search Space: Volume to be searched where each dimension represents a hyperparameter and each point represents one model configuration. A point in the search space is a vector with a specific value for each hyperparameter value. The goal of the optimization procedure is to find a vector that results in the best performance of the model after learning, such as maximum accuracy or minimum error.

A range of different optimization algorithms may be used, although two of the simplest and most common methods are random search and grid search.

Random Search: Define a search space as a bounded domain of hyperparameter values and randomly sample points in that domain.  

Grid Search: Define a search space as a grid of hyperparameter values and evaluate every position in the grid.

Tuning XGBOOST with Down Sampled data

Finding best parameter for high recall using Random Search with cross validation

Building the model with the resulted best parameters

Get scores

Confusion matrix on validation

Tuning AdaBoost with Down Sampled data

Finding best parameter for high recall using Random Search with cross validation

Building the model with the resulted best parameters

Get scores

Confusion matrix on validation

Tuning Light GBM with Down-Sampled data

Finding best parameter for high recall using Random Search with cross validation

Building the model with the resulted best parameters

Get scores

Confusion matrix on validation

Tuning GBM with Down Sampled data

Finding best parameter for high recall using Random Search with cross validation

Building the model with the resulted best parameters

Get scores

Confusion matrix on validation

Comparing Models

Final Model Selection

  • The XGBoost model with hyper parameter tuning and trained with undersampled dataset, has best recall on validation set of ~99%, but accuracy is lower than the human level accuracy (i,e, classifying everyone as non-attriting customers). Thus, we are not selecting this model as the final model
     
  • The validation recall of ~97% is provided by the GBM with hyper parameter tuning trained with undersampled dataset, has validation accuracy of ~94%, and precision of ~74%, Validation AUC ~99%, Cross Validation Mean of 96%. Also, the model is neither suffering from bias, nor variance. We are selecting GBM Tuned with Down Sampling model as our final model

Check Test Data on GBM Tuned and Trained with Downsampled Data

Feature Importance

Test scores

Let's check the performance of the model on Test (unseen) dataset.

The performance of the model with the test data is almost similar to the performance on the validation dataset.

Confusion Matrix

ROC-AUC Curve

ROC AUC characteristic is important to understand how good the model is.

If the model is really good in identifying the classes, the Area Under Curve is really high, close to 1.

If the model can not distinguish the classes well, the Area Under Curve is really low, close to 0.5.

Our model appears to be really good, since the AUC is almost 1.

Productionizing the model

Now that we have finalized our model, we'll build a model pipeline to streamline all the steps of model building. We'll start will the initial dataset and proceed with the pipeline building steps.

Machine Learning (ML) pipeline, theoretically, represents different steps including data transformation and prediction through which data passes. The outcome of the pipeline is the trained model which can be used for making the predictions. Sklearn.pipeline is a Python implementation of ML pipeline. Instead of going through the model fitting and data transformation steps for the training and test datasets separately, we can use Sklearn.pipeline to automate these steps. Here is the diagram representing the pipeline for training our machine learning model based on supervised learning, and then using test data to predict the labels.

Set Static variables

Dependent and independent variables

Split data in Train, Validation and Test sets

Undersampling the training data since that generalized this model really well

Data processing Steps

Build the pipeline

Score the pipeline using test data

Accuracy

Recall

Actionable Insights and Recommendations

Important Features to Understand Customer Credit Card Churn:

  1. Total Transaction Count
  2. Total Transaction Amount
  3. Total Revolving Balance
  4. Total Amount Change Q4 to Q1
  5. Total Count Change Q4 to Q1
  6. Total Relationship Count

Feature Correlation with Attrition Flag:

Recommendations:

  1. The bank should connect with the customer more often to increase engagement and provide various offers and schemes to build stronger relationships.
  2. The bank should offer cashback schemes on credit cards, encouraging customers to use the credit card more frequently.
  3. The bank should offer credit limit increases for customers who are regularly using the credit card, which will likely increase transaction amounts.
  4. Offering 0% interest EMI on credit cards can encourage customers to purchase higher-cost items and convert the expenditure to EMI, thereby increasing transaction amounts and counts, and ensuring the balance revolves nicely.
  5. The bank can introduce specialized credit cards for online shopping (with cashback offers) or online food ordering to encourage frequent use.
  6. Using our model, we can predict customers likely to attrite. Based on predicted probability, the top 20-30% of customers can be targeted with offers such as credit card promotions and credit limit increases to help retain them.